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Many-core chip multiprocessor offers high parallel processing power for big 
data analytics; however, they require efficient multi-level cache and 
interconnection to achieve high system throughput. Using on-chip first level 
LI and second level L2 per core fast private caches is expensive for large 
number of cores. In this paper, for moderate number of cores from 16 to 64, 
we present a cost and performance efficient multi-level cache system with 
per core LI and last level shared bus cache on each bus line of a cost- 
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1. INTRODUCTION 

In recent years, many cores are trending as a on-chip computing platform [1]-[3] that can provide 
massive computational power for a heterogenous computing environment for big data [4] and other compute 
intensive embedded artificial intelligence applications [5]. Some recent work [6]-[9] on high performance 
computing for big data have focused on processing framework, architecture synthesis and utilization of 
multiple cores. With increased very large-scale integration (VLSI) density, it may be still manageable to 
provide heterogeneous computing using cost effective on-chip interconnection and cache memory system. 
From past research on bus-based interconnection for large parallel processing systems [10], it was 
determined that regular bus connected multiple-bus interconnection that uses number of buses equal to one- 
half of the cores or memory modules, gives comparable memory bandwidth. However, the reduced bus 
interconnection is costly for chip multiprocessor (CMP) due to large number of bus-core/memory connections. 
In our earlier research, we proposed a cost-effective interconnection using geometrical patterns for bus- 
core/memory connections [11] with reduced number of buses. The approach in [11] was extended to system 
level configuration defined with three geometrical system configurations termed as geometrical bus 
interconnection (GBI) [12] for bus-memory connections using rhombic connection pattern as the base. We 
achieved cost savings from 1.8x to 2.4x with GBI compared to regular reduced bus interconnection. 
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However, as the overall throughput of the many-core CMP is also determined by the cache system 
performance, achieving high overall CMP througput with cost and performance efficient interconnection and 
cache system is highly desirable today. 

Providing an adequate and sustained many-core CMP throughput becomes more challenging as it 
also requires efficient cache system solution. Towards this challenge, our focus is to present a cost-effective 
multi-level cache system to improve the overall many-core CMP throughput using comparable memory 
bandwidth results from cost-effective GBI [12]. A typical general multi-level cache system hierarchy for 
multi-core systems as shown in Figure | has L1 and L2 private cache per core at levels 1 and 2, and a shared 
cache L3 as a last level cache (LLC) at level 3. For example, some of the current mainstream commercial 
multi-core processor such as Intel® Core™ i5 processor has three levels of cache with per core L1 with a 
separate instruction and data cache, a per core L2 unified (instruction/data) cache and a shared L3 cache as 
LLC (shared by all cores). 


Off-chip 
Interconnect Global 
Memory 


Figure 1. Traditional multi-level cache system with L1, L2 and L3 for multi-core CMP 


Adding a large number of per core fast on-chip private L1 and L2 caches with a shared L3 may 
increase cache system cost. As a result, we propose an alternative solution by combining L1 with a relatively 
slower shared bus cache (SBC) as LLC added to every bus line of GBI [12] in which the data request of all 
cores is shared via GBI. In addition, our proposed cache system solution may also provide the ability to 
increase the cache levels and sizes within the cache hierarchy upon cache reconfiguration in order to optimize 
the system for cost, performance and power consumption. 

Some earlier research [13]-[16] have addressed various cache system architecture, issues and 
solutions for improved performance. In [13], the authors addressed analyzing memory performance for tiled 
many-core CMP. Lin et al. [14] suggested hybrid cache systems that included layers for cache architecture 
from memory to data base to improve performance in specific relational data base query for big data 
applications. Charles et al. [15] looked at cache reconfiguration for network-on-chip (NoC) based many-core 
CMP. Safayenikoo et al. [16] suggested an energy-efficient cache architecture to address the problem of 
increased leakage power resulting from large area of LLC (as much as 50% of the chip area) due to its 
increased size. Most of the work reported in [13]-[16] may require complex cache design process. Our 
proposed cache system solution is simple and do not add any extra or difficult cache design process. Our 
main contribution in this paper are as: 1) Propose a shared bus cache (SBC) within a multi-level cache 
system; ii) Present a least recently used (LRU) multi-level cache system simulation to extract hit and miss 
concurrencies; iii) Apply concurrent average memory access time (C-AMAT) [17] to accurately determine 
the system throughput performance and present our results; and iv) Provide conclusion and present some 
insight into future research. 


2. L1-SBC CACHE SYSTEM 

Figure 2 shows a system with L1 and share bus cache at every bus line of GBI [12]. We term the 
memory system using L1 private cache as L1, with L1 and L2 as L12, with L1 and shared bus cache as L1- 
SBC throughout this paper. 
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Figure 2. LI-SBC 
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2.1. Concurrent average memory access time (C-AMAT) 

Some cache techniques [18]—[20] were suggested earlier for improving traditional average memory 
access time for multi-level cache systems. In [18], hardware prefetching was considered to exploit spatial and 
temporal locality of references. In [19], multi-level caches were considered as primary and secondary 
memories for proxy servers to access web content. In [20], an LRU replacement policy was proposed that 
makes use of the awareness of the cache miss-penalty to ensure memory access latency is balanced for 
memory system built with different memory technologies termed as “hybrid” system. The work addressed in 
[18]-[20] were specific cache techniques attempted to reduce average memory access time without 
considering any cost implications. Our approach is to optimize cache and interconnection cost across the 
cache levels and apply C-AMAT for exploitation of parallel concurrency in cache hit and misses that 
accurately determine the average memory access time across all levels for data access. An analytical method 
for determining C-AMAT is briefly provided below. A traditional average memory access time (AMAT) 
with a multi-level cache system is given in (1) and (2) for LI and L12 cache systems respectively. 


AMAT1 = t,h, + (1 —Ay)tim (1) 


Where t; and tz are the cache access time for level 1 and level 2 caches, h; and hz are cache hit ratios 
for level 1 and level 2 caches and tmis the global memory access time. In our approach, we exploit parallel 
concurrency for core and SBC hit and miss concurrency for SBC supported by GBI and apply C-AMAT for 
performance evaluation. The hit concurrency will improve performance while a cache miss may impact the 
memory system performance, depending on hit concurrency. Taking advantage of multiple buses with miss 
concurrency, higher system performance can be achieved. However, the application of C-AMAT need to 
ensure that the miss concurrency do not exceed the interconnection bandwidth with reduced number of buse. 
Thus, we re-write (1) and (2) as (3) and (4). 


tyha 


C — AMAT1 = +4 (1—hy)tm/em (3) 
C — AMAT2 = 2" 4(—-h,)(2@ + —-A,)t 4 
- = $1 — hy) + (1 = hy) tafe (4) 


Where ch1 and ch2 are the average hit cycle concurrency at levels 1 and 2 and cm is the average 
miss cycle concurrency. In this paper, we evaluate L1, L12 and L1-SBC systems. We selected minimum 
number of L1 and SBC cache blocks to meet the following criterion for hit and miss concurrency given as 


(5). 
Ch SA Cm <n/2 (5) 


Since the GBI interconnection provides a memory bandwidth of n/2, we can also approximate (4) 
by miss concurrency supported by the GBI memory bandwidth as (6). 


tyha 
chi 


tzh2 
ch2 


C — AMAT = +(1-h,)( + (1 — hy) 2t,/n (6) 

When the cm is less than n/2, the interconnection bandwidth is not fully utilized. The C-AMAT 
given in (6) is smaller compared to conservative miss concurrency given in (4). The percentage deviation 
from (4) to (6) varies from 4 to 30 % across all cache systems. We see a higher deviation for L1-SBC system 
which is attributed to the fact that the miss concurrency decreases as a result of higher hit concurrency using 
bus cache during read cycle. In this paper, we only include conservative results from (3) and (4) for L1 and 
L12 cache systems respectively and at the same time ensuring criterion (5). 


2.2. Geometrical bus interconnection (GBI) [12] cost 

Table 1 gives the average normalized interconnection cost of GBI compared to fully reduced 
multiple bus system [10]. We notice a reduction of about 30 % in cost across the number of cores from 16 to 
64. 
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Table 1. Normalized average GBI cost compared to fully reduced bus system [10] 
No. of Cores _ Normalized Cost 


16 0.69 
32 0.66 
64 0.64 


2.3. SBC impact on C-AMAT 

In the past, some shared cache techniques [21] have looked at cache sharing of ways based on hash 
mapping instead of traditional cache set sharing for multi-core platforms. In general, it is known that by 
increasing the number of processor cores can directly increase LLC (last level cache) hit and miss 
concurrency giving reduced C-AMAT. As our system uses buses equal to one-half the number of cores, the 
memory access missed in per core cache is searched in SBC. Since a shared reduced number of buses in our 
approach naturally captures all core accesses via the bus interconnection, placing an SBC at each bus line of 
GBI replicates closely to a traditional L3 shared cache normally used in current commercial processor 
systems. As we used “number of SBC at level 2, any miss in LI increases the hit concurrency in SBC. In our 


approach, we accounted only a pure miss concurrency [17] (only if none of the bus cache has a hit in the hit 
cycle, a miss is accounted). 


2.4. Cache association impact on C-AMAT 

Cache association can also impact our solution. Authors in [22], [23] attributed to the fact that 
higher cache association normally increases the cache hit rate but at the expense of hardware complexity for 
the cache controller and additional latency for cache search time with increased association. However, in our 
approach, the association was selected to ensure that criteria (5) are satisfied. Thus, selecting a direct mapped 
cache may benefit to achieve reduced C-AMAT. In general, miss concurrencies in LLC can normally be 
supported by use of multi-ported memory, or multi-bank memory (memory modules) with a single bus. 
However, for a single bus system, bus contention impacts the throughput performance. The miss concurrency 
can be facilitated by using a multi-bank memory module with multiple bus interconnection between shared 
cache and memory modules. The miss concurrency can be supported by multiple buses in GBI yielding lower 
C-AMAT. 


3. CACHE SYSTEM SIMULATION 
3.1. System operation with L1-SBC 

Figure 2 shows the operation flowchart for read and write cycles for L1 system. SBC is used only 
during “read cycle” with a “write through” policy to update on cache miss. In “normal no-fault mode”, 
during read cycle, the data is first searched in L1. If the L1 read is a “miss,” it is then searched in SBC. If it is 
“hit,” the data is cached. On read “miss” in SBC, buses in GBI are arbitrated to utilize full memory 
bandwidth and the data is read from the global memory module and is written to SBC and L1 cache as well. 
If the current bus that is granted fails, then cache system switches to “bus fault mode” and the 
interconnection is re-arbitrated to use other b-1 connected buses. After bus re-arbitration, the data is re- 
searched first in L1 and if “hit”, the data is cached in L1 cache, otherwise searched in SBC. During write 
cycle, if the L1 cache block is “present”, then data is written into L1 cache. On L1 write “miss”, the L1 cache 
block is replaced and the data is updated to L1 and consequently the data is written to global memory using 
arbitrated buses in GBI. 

The proposed cache system was simulated using publicly available “lrucache” libraries in python 
and created multiple objects of a “Irucache” with indexing to implement L1, L2 and SBC. We iterated cache 
operation for over n x 1000 for n number of cores. Table 2 shows the general parameters used for the 
simulation. Using as much of insight into today’s memory technologies, we approximately used a relative bit 
cost for L1, L2 and SBC as given in Table 2. 
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Figure 2. L1-SBC Operation 
Table 2. Cache system simulation parameters 
Clock Ll access SBC access L2 access time Global memory Bus data Ll L2 SBC 
cycle cycles cycles (t2) (cycles) access cycles (tm) width relative relative relative 
(tl) cost cost cost 
0.5 ns 5 25 10 100 2 bytes 10 6 3 


(2 GHz) 


3.2. Relative normalized system cost 

Table 3 shows the normalized system cost as the total system cost that includes the normalized 
interconnection cost from Table | and relative cache memory cost from Table 2. As we noticed from Table 3, 
L2 cache adds 2 % additional system cost and SBC adds 0.5 % additional cost. We ran simulations using 
minimum number of L1, L2, and SBC cache blocks selected to meet the criteria given in (5). To reduce the 
cache “hit” time, we used an optimal cache association, but at the same time ensured concurrency criteria 
given by (5). 


Table 3. Normalized system cost 


Cache System No. of L1 Ll No. of L2 L2 No. of SBC SBC Normalized system 
blocks association sets association sets association cost 
Ll 128 1 1 
L12 128 1 4 2 1.02 
L1-SBC 128 1 4 2 1.005 


3.3. Cache read and write misses criticality impact 

It is well known that cache read misses are more critical and incurs more penalty in read than write 
cycles. To alleviate this problem, some read-write partitioning policy was suggested in [24] that minimizes 
the read misses using dynamic cache management. To provide more read miss support, in our approach, we 
included SBC during read only. In general, as the read is increased from 50 to 80% of the processor data 
requests, we found drastic improvement in SBC hit concurrency as a result of its exclusive support during 
read cycle. However, as not all applications ensure a less data reads than data writes, we may treat the 50% 
read data requests as a good comparison for now and look for application centric read/write trade-offs in 
future using novel cache protocols. Some recent novel read/write cost tradeoffs for DNA based data storage 
[25] has been suggested. 


3.4. L1 and SBC hit concurrencies and miss concurrencies 

Tables 4 and 5 show the L1 cache hit concurrency (ch1), SBC hit concurrency (ch2), and miss 
concurrency in SBC (cm) for various system sizes for L1, L12 and L1-SBC systems for 50 % and 80 % read 
requests respectively. 
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Table 4. Cache hit and miss concurrency with 50% read requests 


System size 16 32 64 
chi ch2 cm chi ch2 cm _ chi ch2 cm 
Ll 0.4 83 04 16.1 05 31.9 
L12 4.5 4.2 8 8.5 8.2 15.7 16.8 16.1 31.2 


L1-SBC 44 5.2 74 8.5 11.9 128 16.7 315 16.1 


Table 5. Cache hit and miss concurrency with 80% read requests 


System size 16 32 64 
chi ch2 cm chi ch2 cm chi ch2 cm 
Ll 0.4 8.3 0.4 16.1 0.5 31.9 
L12 1.9 6.8 TA 3 13.1 14.3 2S 25.9 29.2 


L1-SBC 2.0 8.2 6.4 3.7 193 106 69 50.9 16.1 


Figures 5 and 6 show the hit and miss concurrency for L1-SBC for 50 % and 80 % read requests 
respectively. For the same number of cores, the miss concurrency decreases for L1-SBC as compared to L1 
due to higher hit concurrency in SBC. The miss concurrency utilization in L1-SBC is about 50 % for larger 
number of cores. This is attributed to the fact that SBC offers higher hit concurrency yielding reduced 
memory traffic over the interconnection. Even though the low miss currency utilization may suggest that the 
number of buses for higher number of cores may be reduced further, it may invariably decrease the hit rate 
for SBC due to lower bandwidth availablity thus nullifying any overall advantage. As the data read are more 
than data writes, SBC hit concurrency increases by approximately 1.5x for the same system size. 
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Figure 5. Cache hit and miss concurrency for L1 with 50 % read requests 
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Figure 6. Cache hit and miss concurrency for L1 with 80 % read requests 
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3.5. Concurrent average memory access time (C-AMAT) cycles 
We evaluated the concurrent average memory access time (C-AMAT) cycles from (3) and (4). 
Tables 6 and 7 show the C-AMAT for 50 % and 80 % read requests respectively. 


Table 6. C-AMAT with 50 % read requests 
Cache system 16 3264 


Ll 124 64 3.2 
L12 41 2.2 11 
L1-SBC 3.7 16 03 


Table 7. C-AMAT with 80 % read requests 
Cache system 16 32. 64 
Ll 124 64 3.2 
L12 65 33 16 
L1-SBC 5.7 24 03 


As a result of increased SBC hit concurrency, the C-AMAT decreases with the number of cores. 
Figure 7 shows the C-AMAT for 50% and 80% read requests respectively. Further reduction in C-AMAT is 
seen for 80% read requests due to increase in SBC hit concurrency. 
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Figure 7. C-AMAT with 50 % and 80 % read data requests 


3.6. Cache system throught 
The throughput in GB/sec (g) given as (6). 


2.b 
9 = (amar str (6) 
Where b is the number of buses with 2 bytes bus data width. We assumed GBI bus arbitration and 
bus allocation reconfiguration time (tr) of 1 cycle and a clock cycle time of 0.5 ns. Table 8 summarizes our 
results for throughput in GB per sec. We used normalized unit cost from Table 3 and C-AMAT using (3) and 
(4). As shown in Table 8, the throughput increases with the number of cores and read request percentage 
suggesting a good advantage. 


Table 8. Throughput in GB/sec for LI-SBC for 50 % and 80 % read requests 
No. of Cores 
16 32 64 
50% 80% 50% 80% 50% 80% 


6.8 4.8 246 188 985 95.5 
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Figure 8 shows the throughput for 50% and 80% read requests respectively. Figure 9 shows the 
average throughput improvement factor for L12 and L1-SBC cache systems over L1 cache system. We found 
that the average throughput improvement factor of L12 cache system across all system sizes is 1.5x for 50 % 
read requests and 1.8x for 80 % read requests compared to L1. We determined that the average throughput 
improvement for L1-SBC memory system is 2.5x for 50 % read requests and 2.4x with 80 % read requests 
compared to L1 system. As there is very negligible cost increase for L1-SBC (0.5%) over L1, we conclude 
that L1-SBC cache is both cost and performance efficient compared to L1 or L12 cache system, L1-SBC 
offers 30 to 60% increase in throughput improvement factor compared to L12 improvement factor over L1. 
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Figure 8. L1-SBC throughput in GB/sec for 50 % and 80 % read data requests 
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Figure 9. Average improvement in throughput for L12 and L1-SBC compared to L1 


3.7. Cache system throughput with single bus fault 

We also ran simulation for L1-SBC with a single bus fault in the system. We used both critical and 
non-critical bus for assigning faulty bus. A bus is a “critical bus” if a memory is only connected to that bus. 
Typically, rhombic interconnection [11] has a single “critical bus”. However, with GBI [12], we provided 
redundant bus paths yielding all buses “non-critical”. Figure 10 shows the percentage degradation of a single 
bus faulted system compared to normal L1-SBC system with 50% read requests. We noticed that the 
percentage degradation in throughput for a single bus fault is less than 5% across all system sizes and 
decreases with higher number of cores. This suggests a good fault tolerance for L1-SBC for increased 
number of cores. 
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Figure 10. Throughput degradation with a single bus fault for 50 % read requests 
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4. CONCLUSION AND FUTURE RESEARCH 

Many-core based heterogeneous system demands high system throughput for big data applications 
and other compute intensive embedded applications. By adding a less expensive SBC in association with 
expensive per core LI private cache within a multi-level cache hierarchy, we can achieve higher system 
throughput. For better accuracy, we extracted cache hit and miss concurrencies at each level and applied 
concurrent average memory access time for L1, L12 and L1-SBC systems. We conducted simulation of LI, 
L12 and L1-SBC cache systems. Our simulation results indicate that by using L1-SBC, we can achieve 2.5x 
throughput improvement compared to using only L1 private cache and we see that L1-SBC offers higher 
increase in throughput improvement factor compared to L12 improvement factor at a very negligible increase 
in SBC cost over L1. We also determined that the throughput degradation using L1-SBC with a single bus 
fault is less than 5 % across all system sizes and this degradation reduces as the system size increases 
suggesting a good advantage for higher number of cores. As we used the SBC only during read request, in 
the future, we hope to develop some additional novel SBC cache protocols using exclusive and shared modes 
and include SBC in both read and write cycles. We also hope to perform some heterogenous computing big 
data application benchmarks with LRU L1-SBC system and assess the overall system performance. 
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